34 research outputs found

    Approximately Minwise Independence with Twisted Tabulation

    Full text link
    A random hash function hh is ε\varepsilon-minwise if for any set SS, S=n|S|=n, and element xSx\in S, Pr[h(x)=minh(S)]=(1±ε)/n\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n. Minwise hash functions with low bias ε\varepsilon have widespread applications within similarity estimation. Hashing from a universe [u][u], the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes c=O(1)c=O(1) lookups in tables of size u1/cu^{1/c}. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing requires Ω(logu)\Omega(\log u)-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields O~(1/n1/c)\tilde O(1/n^{1/c})-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

    One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

    Get PDF
    Sketches are probabilistic data structures that can provide approximate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis on architectures even with limited memory such as single-board computers that are widely exploited for IoT and edge computing. Since these devices offer multiple cores, with efficient parallel sketching schemes, they are able to manage high volumes of data streams. However, since their caches are relatively small, a careful parallelization is required. In this work, we focus on the frequency estimation problem and evaluate the performance of a high-end server, a 4-core Raspberry Pi and an 8-core Odroid. As a sketch, we employed the widely used Count-Min Sketch. To hash the stream in parallel and in a cache-friendly way, we applied a novel tabulation approach and rearranged the auxiliary tables into a single one. To parallelize the process with performance, we modified the workflow and applied a form of buffering between hash computations and sketch updates. Today, many single-board computers have heterogeneous processors in which slow and fast cores are equipped together. To utilize all these cores to their full potential, we proposed a dynamic load-balancing mechanism which significantly increased the performance of frequency estimation.Comment: 12 pages, 4 figures, 3 algorithms, 1 table, submitted to EuroPar'1

    Quicksort, Largest Bucket, and Min-Wise Hashing with Limited Independence

    Get PDF
    Randomized algorithms and data structures are often analyzed under the assumption of access to a perfect source of randomness. The most fundamental metric used to measure how "random" a hash function or a random number generator is, is its independence: a sequence of random variables is said to be kk-independent if every variable is uniform and every size kk subset is independent. In this paper we consider three classic algorithms under limited independence. We provide new bounds for randomized quicksort, min-wise hashing and largest bucket size under limited independence. Our results can be summarized as follows. -Randomized quicksort. When pivot elements are computed using a 55-independent hash function, Karloff and Raghavan, J.ACM'93 showed O(nlogn)O ( n \log n) expected worst-case running time for a special version of quicksort. We improve upon this, showing that the same running time is achieved with only 44-independence. -Min-wise hashing. For a set AA, consider the probability of a particular element being mapped to the smallest hash value. It is known that 55-independence implies the optimal probability O(1/n)O (1 /n). Broder et al., STOC'98 showed that 22-independence implies it is O(1/A)O(1 / \sqrt{|A|}). We show a matching lower bound as well as new tight bounds for 33- and 44-independent hash functions. -Largest bucket. We consider the case where nn balls are distributed to nn buckets using a kk-independent hash function and analyze the largest bucket size. Alon et. al, STOC'97 showed that there exists a 22-independent hash function implying a bucket of size Ω(n1/2)\Omega ( n^{1/2}). We generalize the bound, providing a kk-independent family of functions that imply size Ω(n1/k)\Omega ( n^{1/k}).Comment: Submitted to ICALP 201

    Picture-Hanging Puzzles

    Full text link
    We show how to hang a picture by wrapping rope around n nails, making a polynomial number of twists, such that the picture falls whenever any k out of the n nails get removed, and the picture remains hanging when fewer than k nails get removed. This construction makes for some fun mathematical magic performances. More generally, we characterize the possible Boolean functions characterizing when the picture falls in terms of which nails get removed as all monotone Boolean functions. This construction requires an exponential number of twists in the worst case, but exponential complexity is almost always necessary for general functions.Comment: 18 pages, 8 figures, 11 puzzles. Journal version of FUN 2012 pape

    Triangle Counting in Dynamic Graph Streams

    Get PDF
    Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm estimating the number of triangles in {\em dynamic} graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. The result is achieved by a novel approach combining sampling of vertex triples and sparsification of the input graph. In the course of the analysis of the algorithm we present a lower bound on the number of pairwise independent 2-paths in general graphs which might be of independent interest. At the end of the paper we discuss lower bounds on the space complexity of triangle counting algorithms that make no assumptions on the structure of the graph.Comment: New version of a SWAT 2014 paper with improved result

    Dynamic Compressed Strings with Random Access

    Full text link
    We consider the problem of storing a string S in dynamic compressed form, while permitting operations directly on the compressed representation of S: access a substring of S; replace, insert or delete a symbol in S; count how many occurrences of a given symbol appear in any given prefix of S (called rank operation) and locate the position of the ith occurrence of a symbol inside S (called select operation). We discuss the time complexity of several combinations of these operations along with the entropy space bounds of the corresponding compressed indexes. In this way, we extend or improve the bounds of previous work by Ferragina and Venturini [TCS, 2007], Jansson et al. [ICALP, 2012], and Nekrich and Navarro [SODA, 2013]

    Yes, There is an Oblivious RAM Lower Bound!

    Get PDF
    An Oblivious RAM (ORAM) introduced by Goldreich and Ostrovsky [JACM\u2796] is a (possibly randomized) RAM, for which the memory access pattern reveals no information about the operations performed. The main performance metric of an ORAM is the bandwidth overhead, i.e., the multiplicative factor extra memory blocks that must be accessed to hide the operation sequence. In their seminal paper introducing the ORAM, Goldreich and Ostrovsky proved an amortized Ω(lgn)\Omega(\lg n) bandwidth overhead lower bound for ORAMs with memory size nn. Their lower bound is very strong in the sense that it applies to the ``offline\u27\u27 setting in which the ORAM knows the entire sequence of operations ahead of time. However, as pointed out by Boyle and Naor [ITCS\u2716] in the paper ``Is there an oblivious RAM lower bound?\u27\u27, there are two caveats with the lower bound of Goldreich and Ostrovsky: (1) it only applies to ``balls in bins\u27\u27 algorithms, i.e., algorithms where the ORAM may only shuffle blocks around and not apply any sophisticated encoding of the data, and (2), it only applies to statistically secure constructions. Boyle and Naor showed that removing the ``balls in bins\u27\u27 assumption would result in super linear lower bounds for sorting circuits, a long standing open problem in circuit complexity. As a way to circumventing this barrier, they also proposed a notion of an ``online\u27\u27 ORAM, which is an ORAM that remains secure even if the operations arrive in an online manner. They argued that most known ORAM constructions work in the online setting as well. Our contribution is an Ω(lgn)\Omega(\lg n) lower bound on the bandwidth overhead of any online ORAM, even if we require only computational security and allow arbitrary representations of data, thus greatly strengthening the lower bound of Goldreich and Ostrovsky in the online setting. Our lower bound applies to ORAMs with memory size nn and any word size r1r \geq 1. The bound therefore asymptotically matches the known upper bounds when r=Ω(lg2n)r = \Omega(\lg^2 n)

    Lower Bounds for Multi-Server Oblivious RAMs

    Get PDF
    In this work, we consider the construction of oblivious RAMs (ORAM) in a setting with multiple servers and the adversary may corrupt a subset of the servers. We present an Ω(logn)\Omega(\log n) overhead lower bound for any kk-server ORAM that limits any PPT adversary to distinguishing advantage at most 1/4k1/4k when only one server is corrupted. In other words, if one insists on negligible distinguishing advantage, then multi-server ORAMs cannot be faster than single-server ORAMs even with polynomially many servers of which only one unknown server is corrupted. Our results apply to ORAMs that may err with probability at most 1/1281/128 as well as scenarios where the adversary corrupts larger subsets of servers. We also extend our lower bounds to other important data structures including oblivious stacks, queues, deques, priority queues and search trees

    Lower Bounds for Encrypted Multi-Maps and Searchable Encryption in the Leakage Cell Probe Model

    Get PDF
    Encrypted multi-maps (EMMs) enable clients to outsource the storage of a multi-map to a potentially untrusted server while maintaining the ability to perform operations in a privacy-preserving manner. EMMs are an important primitive as they are an integral building block for many practical applications such as searchable encryption and encrypted databases. In this work, we formally examine the tradeoffs between privacy and efficiency for EMMs. Currently, all known dynamic EMMs with constant overhead reveal if two operations are performed on the same key or not that we denote as the global key-equality pattern\mathit{global\ key\text{-}equality\ pattern}. In our main result, we present strong evidence that the leakage of the global key-equality pattern is inherent for any dynamic EMM construction with O(1)O(1) efficiency. In particular, we consider the slightly smaller leakage of decoupled key-equality pattern\mathit{decoupled\ key\text{-}equality\ pattern} where leakage of key-equality between update and query operations is decoupled and the adversary only learns whether two operations of the same type\mathit{same\ type} are performed on the same key or not. We show that any EMM with at most decoupled key-equality pattern leakage incurs Ω(logn)\Omega(\log n) overhead in the leakage cell probe model\mathit{leakage\ cell\ probe\ model}. This is tight as there exist ORAM-based constructions of EMMs with logarithmic slowdown that leak no more than the decoupled key-equality pattern (and actually, much less). Furthermore, we present stronger lower bounds that encrypted multi-maps leaking at most the decoupled key-equality pattern but are able to perform one of either the update or query operations in the plaintext still require Ω(logn)\Omega(\log n) overhead. Finally, we extend our lower bounds to show that dynamic, response-hiding\mathit{response\text{-}hiding} searchable encryption schemes must also incur Ω(logn)\Omega(\log n) overhead even when one of either the document updates or searches may be performed in the plaintext
    corecore